review comment
SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning
Liu, Fang, Liu, Simiao, Zhu, Yinghao, Lian, Xiaoli, Zhang, Li
Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates' code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs' ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.05)
- Asia > China (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects
Begolli, Igli, Aksoy, Meltem, Neider, Daniel
Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C\# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.
- Asia > Macao (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.06)
- (21 more...)
MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
Gao, Xian, Ruan, Jiacheng, Zhang, Zongyun, Gao, Jingsheng, Liu, Ting, Fu, Yuzhuo
With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach
Yu, Yongda, Shi, Guohao, Wu, Xianwei, He, Haochuan, Gu, XueMing, Zhao, Qianqian, Liu, Kui, Wang, Qiushi, Tian, Zhao, Shen, Haifeng, Rong, Guoping
Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.
- Asia > China > Jiangsu Province > Nanjing (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Oceania > Australia (0.04)
- (5 more...)
The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors
Sadallah, Abdelrahman, Baumgärtner, Tim, Gurevych, Iryna, Briscoe, Ted
Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > Singapore (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Previously on... Automating Code Review
Heumüller, Robert, Ortmeier, Frank
Modern Code Review (MCR) is a standard practice in software engineering, yet it demands substantial time and resource investments. Recent research has increasingly explored automating core review tasks using machine learning (ML) and deep learning (DL). As a result, there is substantial variability in task definitions, datasets, and evaluation procedures. This study provides the first comprehensive analysis of MCR automation research, aiming to characterize the field's evolution, formalize learning tasks, highlight methodological challenges, and offer actionable recommendations to guide future research. Focusing on the primary code review tasks, we systematically surveyed 691 publications and identified 24 relevant studies published between May 2015 and April 2024. Each study was analyzed in terms of tasks, models, metrics, baselines, results, validity concerns, and artifact availability. In particular, our analysis reveals significant potential for standardization, including 48 task metric combinations, 22 of which were unique to their original paper, and limited dataset reuse. We highlight challenges and derive concrete recommendations for examples such as the temporal bias threat, which are rarely addressed so far. Our work contributes to a clearer overview of the field, supports the framing of new research, helps to avoid pitfalls, and promotes greater standardization in evaluation practices.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Germany > Saxony-Anhalt > Magdeburg (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.93)
Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification
Nguyen, Linh, Liu, Chunhua, Lin, Hong Yi, Thongtanunam, Patanamon
--Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. T o address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high-and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process. Index T erms --code review, review comment classification, prompt engineering, large language models. Code Review (CR) is a practice in software development where developers review other developer's code changes asynchronously to find defects and suggest improvements [1]. Acting as a quality assurance gateway, CR has become mandatory in many prominent organizations, with developers reportedly spending 10-15% of their time on this task [2]. In practice, various types of concerns can be raised in CR comments, ranging from code styling to functional issues. As comments often trigger the improvements of code changes, the types of comments play a crucial role in the quality of CR. Constructive and actionable comments addressing quality-improving issues would positively contribute to CR's overall quality and code changes [3]-[5]. On the other hand, trivial or irrelevant comments can waste developers' time without improving the code changes [6].
- North America > United States (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Canada > Ontario > Toronto (0.04)
AI-Assisted Fixes to Code Review Comments at Scale
Maddila, Chandra, Ghorbani, Negar, Saindon, James, Thakkar, Parth, Murali, Vijayaraghavan, Abreu, Rui, Shen, Jingyue, Zhou, Brian, Nagappan, Nachiappan, Rigby, Peter C.
Aim. There are 10s of thousands of code review comments each week at Meta. We developed Metamate for Code Review (MetaMateCR) that provides AI-assisted fixes for reviewer comments in production at scale. Method. We developed an internal benchmark of 64k
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- North America > Canada (0.04)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Leveraging Reward Models for Guiding Code Review Comment Generation
Sghaier, Oussama Ben, Tufano, Rosalia, Bavota, Gabriele, Sahraoui, Houari
Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems. Despite these benefits, code review can be rather time consuming, and influenced by subjectivity and human factors. For these reasons, techniques to (partially) automate the code review process have been proposed in the literature. Among those, the ones exploiting deep learning (DL) are able to tackle the generative aspect of code review, by commenting on a given code as a human reviewer would do (i.e., comment generation task) or by automatically implementing code changes required to address a reviewer's comment (i.e., code refinement task). In this paper, we introduce CoRAL, a deep learning framework automating review comment generation by exploiting reinforcement learning with a reward mechanism considering both the semantics of the generated comments as well as their usefulness as input for other models automating the code refinement task. The core idea is that if the DL model generates comments that are semantically similar to the expected ones or can be successfully implemented by a second model specialized in code refinement, these comments are likely to be meaningful and useful, thus deserving a high reward in the reinforcement learning framework. We present both quantitative and qualitative comparisons between the comments generated by CoRAL and those produced by the latest baseline techniques, highlighting the effectiveness and superiority of our approach.
- Europe > Switzerland (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- South America > Uruguay > Maldonado > Maldonado (0.04)
- (3 more...)
- Research Report > New Finding (0.68)
- Research Report > Promising Solution (0.46)
- Leisure & Entertainment > Games > Computer Games (0.46)
- Health & Medicine > Consumer Health (0.34)